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(54) Rate-controlled multi-dass high-capacity packet switch 



(57) A high-capacity switch for transferring vanable- 
sized packets under rate control is described • Thepack- 
ets are divided in the switch into segments of predeter- 
mined equal size. The packets are reconstructed before 
egress from the switch. The switch serves traffic of d>f- 
toent classes of service, but the classof ^^'^.sbnc- 
tion is relevant only at the ingress (32) or egress (36) 
modules. The switch control is preferably centralized to 



facilitate effective sharing of the inner capacity of the 
switch. The control is based on modulating the 
ingress/egress rate according to traff ic load, the central 
control being unaware of the class of service disposition 
of the traff ic it controls. The advantage is a high capacity 
switch adapted to transfer variable-sized packets with 
guaranteed rate control. 
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Description 
TECHNICAL HELD 

[0001 ] This invention relates generally to the field of 
data packet switching and, in particular, to a rate-con- 
trolled very high-capacity switching node adapted to 
transfer data packets of variable size. 

BACKGROUND OF THE INVENTION 

[0002] There is a need for data networks adapted to 
transfer data in packets of varying and arbitrary size. A 
principal design objective for switching nodes in such 
networks is to realize a switch of scalable capacity and 
controlled service quality. A popular approach to achiev- 
ing this objective is to build networks based on the 
Asynchronous Transfer Mode (ATM) protocol. ATM pro- 
tocol has, in fact, succeeded in facilitating the construc- 
tion of very-high capacity switches and in providing 
effective means of service-quality control by enabling 
the enforcement of data transfer rate control. ATM, how- 
ever, is only adapted to switch packets, referred to as 
cells, of a 53-byte fixed cell length. Switching ceils of 
fixed size rather than packets of variable size simplifies 
the design of switches. 

[0003] ATM switches are cell-synchronous switches 
and are somewhat simpler to build and grow than a 
switch adapted to switch variable-sized packets. How- 
ever, there is a disadvantage in using a network that 
operates under a protocol which accommodates only 
f ixed-size cells. When variable sized packets are trans- 
ferred through such a network, the packets must be 
deconstructed at a network edge device and packed 
into an appropriate number of cells. In the process of 
packing the contents of the packets into cells, a propor- 
tion of the cells is underutilized and must be padded 
with null data. Consequently, a proportion of the trans- 
port capacity of the network is wasted because of the 
partially filled cells. Another disadvantage of transfer- 
ring variable-sized packets in cells is that when a single 
cell is lost from a packet, the entire packet must be dis- 
carded but this is undetectable until all the remaining 
cells reach a destination edge of the network. When a 
cell belonging to a given packet is lost, the remaining 
cells are unknowingly transferred on to the destination 
edge of the ATM network, only to be discarded there 
because the packet is incomplete. If a packet is trans- 
ferred as a single entity the entire packet may be lost at 
a point of congestion, but further downstream consump- 
tion of network resources is avoided. 
[0004] A node adapted to switch variable-sized 
packets may be more complex than a node adapted to 
switch fixed-sized packets (cells). However, the cost 
resulting from the extra complexity is more than offset 
by the increased efficiency gained in the network. Con- 
nections can be established in a network using variable- 
sized packet switches, and the connections can be rate- 
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controlled on either a hop-by-hop or end-to-end basis, 
so that sufficient transfer capacity can be reserved to 
satisfy service quality requirements. One network archi- 
tecture for achieving this is a network that employs the 
5 Universal Transfer Mode (UTM) protocol, described in 
detail in United States Patent application No.09/1 32,465 
to Beshai, filed August 11, 1998. UTM retains many of 
the desirable features of ATM but adds more flexibility, 
such as the option to mix connection-based and con- 
70 nectionless traffic, yet significantly simplifies connec- 
tion-setup, routing, and flow-control functions. The 
simplicity of connection-setup facilitates the use of 
adaptive means of admission control. Admission control 
is often based on traffic descriptors which are difficult, if 
? 5 not impossible, to determine with a reasonable degree 
of accuracy. An alternative is to use adaptive admission 
control that is based on monitoring the network traffic 
and requesting a change in transfer allocations on the 
basis of both the traffic load and class of service distinc- 
20 tions. The traffic at each egress module is sorted 
according to destination and class of service. The pack- 
ets thus sorted are stored in separate logical buffers 
(usually sharing the same physical buffer) and the occu- 
pancies of the buffers are used to determine whether 
25 the capacity of a connection should be modified. 

[0005] In a network shared by a variety of users, 
class of service distinctions are required to regulate traf- 
fic flow across the network Several traffic types may 
share the network, each traffic type specifying its own 
30 performance objectives. Enforcing class of service dis- 
tinctions within a switching node adds another dimen- 
sion that can potentially complicate the design of the 
switch. 

[0006] A high capacity switch is commonly con- 

35 structed as a multi-stage, usually three-stage, architec- 
ture in which ingress modules communicate with egress 
modules through a switch core stage. The transfer of 
data from the ingress modules to the egress modules 
must be carefully coordinated to prevent contention and 

40 maximize the throughput of the switch. Within the 
switch, the control may be distributed or centralized. A 
centralized controller must receive traffic state informa- 
tion from each of the N ingress modules. Each ingress 
module reports the volume of waiting traffic destined to 

45 each of N egress modules. The centralized controller 
therefore receives N 2 elements of traffic information 
with each update. If, in addition, the controller is made 
aware of the class of service distinctions among the 
waiting traffic, the number of elements of traffic informa- 

so tion increases accordingly. Increasing the number of 
elements of traffic information increases the number of 
control variables and results in increasing the computa- 
tional effort required to allocate the ingress/egress 
capacity and to schedule its usage. It is therefore desir- 

55 able to keep the centralized controller unaware of class 
of service distinctions while providing a means of taking 
the class of service distinctions into account during the 
ingress/egress transfer control process. 



2 



BNSDOCID: <EP 1026856A2_L> 



EP 1 026 856 A2 



'[0007] A high capacity ATM switch which uses a 
space switch core to interconnect ingress modules to 
egTess modules is described in US Patent No 
5 475.679 which issued to MQnter on December 12, 
1 995. The controller of the switch coordinates the trans- 
fer of bursts of ATM cells between the ingress modules 
and the egress modules. One of the limitations of the 
space switch architecture, whether applied to TDM or 
ATM. is the necessity to arbitrate among a multiplicity of 
ingress/egress connection attempts. 
fOOOM This limitation can be removed by spatial dis- 
engagement using a rotator-based switch architecture. 
In the rotator-based switch architecture, the space 
switch core is replaced by a bank of independent mem- 
ories that connect to the ingress modules of the switch 
through an ingress rotator. Traffic is transferred to the 
egress modules of the switch through an egress rotator. 
The two rotators are synchronized. A detailed descrip- 
tion of the rotator switch architecture is provided in 
United States Patent No. 5.745.486 that issued to 
Beshai et al. on April 28. 1998. 

[0009] The rotator switch architecture described by 
Beshai et al. works well forf ixed length packet protocols 
such as asynchronous transfer mode (ATM). It is not 
adapted for use with variable sized packets, however. 
Consequently, there is a need for a switch that can effi- 
ciently transfer variable sized packets. To be commer- 
cially viable, the switch must also be adapted to operate 
in an environment that supports multiple classes of 
service and is rate-regulated to ensure a committed 
quality of service. 

SUMMARY OF THE INVENTION 



and M egress modules. N and M being integers greater 
than one. and a switching core adapted to permit pack- 
ets to be transferred from any one of the ingress mod- 
ules to any one of the egress modules, wherein the data 
5 packets are sorted into ingress buffers at the ingress 
modules so that the packets are arranged in egress 
module order, CHARACTERIZED by: 

associating a label with each packet to permit an 
w ingress module at which the packet was received to 
be identified prior to sorting in the ingress modules; 
and 

sorting data packets into egress buffers at trie 
egress modules using the label to determine a sort 
, 5 order of the data packets in the egress buffers. 



[001 0] The invention relates to a switch architecture 
designed for switching packets of arbitrary and variable 
size under rate control from ingress to egress. Two alter- 
native architectures are described. The first uses a 
space-switched core, and the second uses a core that 
consists of an array of memories interposed between 
two rotators that function in combination like a flexible 
space switch. Each of these architectures has been 
used for fixed-sized packet applications such as ATM. 
as described above. Control methods in accordance 
with the invention utilize these known switch architec- 
tures to achieve high-speed switching of variable-sized 

packets. , . . 

[001 1 ] In order to implement the control methods in 
accordance with the invention, the switch apparatus 
must include buffers that permit ingress packets to be 
sorted by output module and permit packets waiting for 
egress to be sorted by ingress module. Preferably the 
packefe-are-ateo'sorted by class of serv.ce at both the 
ingress modules and the egress modules. 
[00121 In accordance with a first aspect of the 
invention, there is provided a method of reciprocal traffic 
control in a switching node for use in a data packet net- 
work, the switching node including N ingress modules 



[0013] The packets in each set of buffers are also 
preferably sorted by class of service. Class of service 
information is hidden from a transfer allocation mecha- 

20 nism. however. The traffic-load data and the guaranteed 
minimum rates determined by a connection-admission- 
control process are passed to the transfer allocation 
mechanism which computes a transfer schedule for 
each ingress/egress pair of modules. An advantage of 

25 containing the class of service differentiation in the 
ingress modules is that the switch becomes more scale- 
able because a computational bottleneck in the central 
control is avoided. 

[0014] In order to facilitate the transfer of vanable- 
30 sized packets through the core, the packets are divided 
in the ingress modules into packet segments of equa 
size A last segment of each packet is padded wrth null 
data, if required. Each packet is appended to a header 
that contains a label which identifies the ingress mod- 
35 ule identifies whether the packet segment is a last seg- 
ment in a packet, and further identifies whether a last 
packet segment is a full-length packet segment or a 
null-padded packet segment. The packet segments are 
sorted in the ingress modules using the labels to deter- 
40 mine a sort order. Consequently, the packets are 
ordered for re-assembly in the egress module and are 
transferred from the switch in the variable-seed format 
in which they were received. 

[0015] The mechanism for selecting segments tor 
45 transfer to the core is preferably operable independently 
of the conditions at others of the input modules. This 
enables the mechanism for selecting segments for 
transfer to be simpler, and therefore faster and easier to 
seal© up. 

so [0016] The mechanism for selecting segments to 
be transferred preferably enables better sharing of core 
capacity between input modules, while confining the 
additional complexity related' to the- management ot- 
class of service to the ingress modules. 
55 r0017] In accordance with a further aspect of the 
invention, there is provided a switching node for switch- 
ing data packets having a plurality of ingress modules 
each including a segmentation mechanism for decon- 
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structing the packets into segments of a P'edeterm,ned 
length at ingress, storing the segments in buffers and 
sorting the segments, and a plurality of agmim^ 
and a switch core interconnecting the ingress modules 
and the egress modules. CHARACTERIZED by: 

a selector for selecting which of the buffered seg- 
ments stored in a given one of the ingress modules 
to transfer to the switch core according to the traffic 
class of service property; and 
a packet assembly mechanism for reconstructing 
each packet at egress so that each packet is trans- 
ferred from the switching node in a format in which 
it was received at the ingress module. 

[0018] The capacity of the data packet switch is 
shared by sending a committed-capacity matr.x from 
each ingress module to a transfer allocation mecha- 
nism each element in the committed-capacrty matrix 
containing the committed capacity of each ingress mod- 
ule with respect to each of the egress modules. A matrix 
storing a number of traffic units waiting to be transferred 
from the ingress module to each of the egress modules 
is also sent from each ingress module to the transfer 
allocation mechanism. A base matrix is created at the 
transfer allocation mechanism, each entry in the base 
matrix being a lesser of corresponding entries in the 
matrix containing the committed capacity and the > matnx 
containing the traffic units waiting to be transferred. 
Entries in the base matrix are subtracted from corre^ 
spending entries in the matrix containing the traffic unrts 
to create an unassigned traffic matrix. An unused 
capacity for each ingress module and each egress mod- 
ule is computed. The N entries in a diagonal set of the 
unassigned traffic matrix are simultaneously Pressed. 
For each ingress/egress pair belonging to a ^agonal 
set an additional ingress/egress transfer allocation is 
determined on a basis of the least one of an unused 
capacity of an ingress module of the 'ngress/egress 
paTr. an unused capacity of an egress module of he 
ingress/egress pair. and a "J*™*" 
ingress/egress entry in the unassigned » 
the additional ingress/egress transfer allocation is 
greater than zero, its value is subtracted from the 
Snused capacrty entry at ingress, the unused capacrty 
entry at egress, and the ingress/egress entry '" the wag- 
ing traff ic matrix. These steps are repeated until all diag- 
onals in the matrix are processed. 
repeated each transfer allocation period, and a different 
order of diagonal processing is selected for each trans- 
fer allocation period. 

[001 9] This switch resource shanng ensures an effi- 
cient use of switch resources. , . a 
[0020] The present invention therefore advanta- 
geous* provides a method and an apparatus forswrtcrv- 
?ng variable sized packets at a controlled I rate 
determined by traffic class of service and destination 
Indeed, the present invention beneficially provides a 



rate-controlled, variable-sized packet switch having a 
very high capacity. As regards the rate-controlled, varia- 
ble-sized packet switch of the preferred embodiment, a 
core controller for the switch need not be aware of 
5 class-of-service distinctions. 

[0021 ] In a preferred embodiment of variable-sized 
packet switch, variable-sized packets may be seg- 
mented on ingress into f ixed size segments with data in 
a last segment in each packet being padded with null 
w data, if necessary. Another embodiment allows for 
ingress packet segments to be sorted by egress mod- 
ule while packet segments waiting for egress may be 
sorted by ingress module in order to facilitate the re- 
assembly of the variable-sized packets. In a further 
,5 embodiment, ingress packets are sorted by both egress 
module and class of service, and packets waiting for 
egress are sorted by both ingress module and class of 
sgi* vies 

[0022] A preferred embodiment of the present 
20 invention beneficially provides a rate-controlled ^varia- 
ble-sized packet switch having a central controller which 
receives data from the ingress modules and computes 
transfer allocations based on the data received. 
[0023] In yet another embodiment a controller of a 
25 rate-controlled variable-sized packet switch further 
schedules the transfer allocations and supplies each 
ingress module with a transfer schedule on a periodic 
basis. 



30 BRIEF DESCRIPTION OF THE DRAWINGS 

r00241 The invention will now be further explained 
by way of example only, and with reference to the 
accompanying drawings, in which: 
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FIG 1 is a schematic diagram of a three-stage 
packet switch showing ingress modules, egress 
modules, and an interconnecting core, with the 
ingress modules hosting the ingress-to-egress rate 
controllers, and the egress module hosting the 
egress rate controllers; 

FIG 2 shows the structure of a fixed-length seg- 
ment of a packet of an arbitrary length, with an 
ingress-port identifier and a two-bitflag to indicate a 
last segment of a segmented packet; 
FIG 3 shows a train of segmented packets of van- 
able size, some of the segmented packets having 
last segments that are shorter than a full data seg- 

FIG 4 illustrates the reciprocal ingress and egress 
sorting processes, where the packets are seg- 
mented at ingress and sorted according to egress 
module, and the segmerrts-rransferrerrto the egress 
modules are sorted according to ingress m°d" le : 
FIG 5 is a schematic diagram of a packet switch 
with' a space-core and a traffic control mechanism 
in accordance with the invention; 
FIG. 6 is a schematic diagram of a rotator-based 
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* switch in which each ingress module may inde- 
pendently transfer its segments to their respective 
egress modules; 

FIG. 7 shows a more detailed view of a rotator- 
based switch core; 5 
FIG. 8 illustrates a process in a rotator switch in 
which an ingress module independently writes 
packet segments to corresponding vacancies in a 
core memory, the number of transferred segments 
being limited by predetermined transfer allocations; 10 
FIG. 9 shows a process similar to that shown in 
FIG. 8 except that segments may only be written to 
a single memory location during a rotator cycle; 
FIG. 10 illustrates segment interleaving during nor- 
mal operation of a rotator-based switch; is 
FIG. 1 1 schematically illustrates a segment aggre- 
gation device based on packet age, size, and class 
of service; 

FIG. 12 illustrates a two-stage control system 
' required to effect segment transfer under rate con- so 

trol, the control system including an ingress/egress 
transfer allocation mechanism and an 
ingress/egress transfer mechanism; 
FIG. 13 illustrates a procedure used by the 
ingress/egress transfer allocation mechanism to 25 
determine committed capacity allocations for 
ingress/egress module pairs. 
FIG. 14 illustrates a traffic-independent procedure 
used by an ingress/egress transfer allocation mech- 
anism to distribute the unused capacity of the 30 
switch core to selected ingress/egress module 
pairs; 

FIG. 15 illustrates a procedure used by an 
ingress/egress transfer allocation mechanism to 
distribute the unused capacity of the switch core to 35 
selected ingress/egress module pairs based on 
traffic load; 

FIG. 16 schematically illustrates a mechanism for 
spatial matching in an access scheduling process; 
* FIG. 1 7 shows a transfer assignment example; <to 

FIG. 18 is a table indicating the effects of different 
ways of partitioning the matching process; and 
FIG. 19 illustrates an arrangement for implementa- 
tion of a fast matching process. 

45 

DEFINITIONS 
[0025] 

(1) Space Switch: so 

A switch without payload core memory that 
connects any of a number of input ports to any of a 
number of output ^trurriero>ntrot-crf-an address? 
ing device. 

(2) Rotator: 55 

A clock-driven space switch that is much sim- 
pler to control than a space switch and easier to 
expand to accommodate a very large number of 
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ports. 

(3) Ingress Module: 

A multiplexer connected to a switch core for 
receiving data from one or more incoming links. 

(4) Egress Module: 

A demultiplexer that receives data from a 
switch core and transfers the data through one or 
more egress links. 

(5) Ingress Link: 

A link that delivers data to an ingress module 
from one or more sources. 

(6) Egress Link: 

A link that delivers data from an egress module 
to one or more sinks. 

(7) Inner Link: 

A link connecting an ingress module to the 
core, or the core to an egress module. 

(8) Rotator Period: 

The time taken for any ingress module to 
access each core memory in a rotator-based 
switch. It is equal to the access interval multiplied 
by the number of ingress modules. 

(9) Section: 

A logical partition in a core memory of a rotator 
switch core. A section is capable of storing at least 
one parcel. 

(10) Segment: 

A data unit of a predetermined size, 1024 bits, 
for example, into which a packet of an arbitrary size 
is divided. A packet may be padded with null data in 
order to form an integer number of segments. 

(11) Parcel: 

An aggregation of a predetermined number of 
segments formed at an ingress module, each seg- 
ment in the parcel being destined to egress from 
the same egress 

module. 

(12) Time Slot: 

The smallest time unit considered in the sched- 
uling process, selected to be the time required to 
transfer a segment across an inner link between an 
ingress module and the switch core, or between the 
switch core and an egress module. 

(13) Access Interval: 

The time interval during which an ingress mod- 
ule accesses a core memory in a rotator-based 
switch core. Also, the time interval during which a 
parcel is transferred from an ingress module to an 
egress module in a space-core switch. 

(14) Segmentation: 

The process of dividing a packet into seg- 
ments. 

(1 5) * Traffic Stream* ' 

One or more segmented packets identified by 
an ingress port, egress port, and optionally a class 
of service distinction. 

(16) Aggregation: 

The process of grouping segments having a 
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common egress module destination .nto parcels of 
predetermined capacity. A parcel .s padd* Iwrth 
null data if it is formed with fewer segments than the 
predetermined segment capacity of the parcel. 
(17) Segmentation Waste: 

Switch capacity waste resulting from nul pad- 
ding required to yield an integer number of seg- 
ments from certain packets. Segmentation wasters 
expressed as a proportion of null data in segments 
transferred during an arbitrary observation period 10 
A small segment size reduces segmentaton waste 
but restricts the capacity limit of the switch^ Nor- 
mally, the segment length is limited by the extent of 
fabric parallelism in the switch core. ^ 
(1 8) Aggregation Waste: 

The idle time spent in sending null data used to 
fill parcels. Aggregation waste is expressed asa 
proportion of null segments in parcels transferred 
during an arbitrary observation period. Aggregator, 
waste can be reduced by imposing a parcel-forma- 20 
tion rule that imposes a waiting period before an 
incomplete parcel can be transferred. 

(19) Capacity: 

A network entity (switch, link, path, or connec- 
tion) dependent measure. The total bit rate that can 25 
be supported by the referenced entity. 

(20) Committed Capacity: 

Guaranteed service rate for a connection, usii- 
allv expressed as the maximum permissible 
numbe^of data units to be transferred in a prede- so 

fined time period. 

(21) Uncommitted Capacity: 

The remaining capacity, if any, of a link after 
accounting for all capacity reservations. ^ 

(22) Unclaimed Capacity: . . 
The difference, if greater than zero, between 

the permissible number of segments to be trans- 
ferred in a specified time frame and the number of 
segments waiting to be transferred. ^ 

(23) Unused Capacity: 

Sum of the uncommitted capacity and the 

unclaimed capacity. 

(24) Diagonal Set: 

A set of N ingress/egress module pairs, where 
N is the number of ingress or egress modules, « 
selected such that each ingress module is encoun- 
tered exactly once and each egress module is 
encountered exactly once in the set. There are N 
diagonal sets numbered 0 to N-1 . where diagonal- 
set 0 includes ingress-egress pair (0. 0) and d.ago- so 
nal-setN-1 includes ingress-egress pair (0. N-i). 

(25) Transfer Allocation: 

The process of determining the number of seg- 
ments to be transferred from an ingress module to 
an egress module during a predef ined transfer alio- 55 
cation period. The product of the transfer allocation 
process is a matrix that stores the number of seg- 
ments to be transferred from each ingress module 



to each egress module during a predefined transfer ( 
allocation period. 

(26) Transfer Allocation Cycle: 

A sequence of steps performed to determine a 
permissible number of segments to be transferred 
for each ingress/egress pair, during a selected 
transfer allocation period. 

(27) Transfer Allocation Period: 

A period of time allotted to perform the steps of 
a transfer allocation cycle. In a rotator-based 
switch a transfer allocation period is preferably 
selected to be an integer multiple of a rotator 
period. 

(28) Transfer Allocation Efficiency: 

Ratio of the number of segments allocated by a 
transfer allocation device in a transfer allocation 
period and a theoretical number of segments that 
can be allocated during a transfer allocation period 
of the same length by an ideal transfer allocation 
device under the same traffic conditions. 

(29) Service Rate: 

The rate at which traffic is transferred through a 

network. 

(30) Temporal Matching: 

A process for determining, for an 
ingress/egress pair, time intervals in a predefined 
time-frame during which an ingress module and an 
egress module are available. 

(31 ) Spatial Matching : 

A process for determining available ingress 
and egress modules during a given access interval, 
in the space-core architecture, the egress-module 
has two states: busy and available. In the rotator- 
based architecture, the states of the egress mod- 
ules are perceived by an ingress module during a 
given access interval as the number of parcels in 
each section of the accessed core memory, or the 
number of available parcel slots in each section. 

(32) Scheduling: 

The process of specifying the time intervals 
during which allocated segments are to be trans- 
ferred from each ingress module without collision. 
The scheduling process is also called an assign- 
merit process. 

(33) Scheduling Cycle: 

A sequence of steps performed during a trans- 
fer allocation period to determine a time table for 
transferring a permissible number of segments for 
each ingress/egress pair. 

(34) Scheduling Period: 

A predefined time period for completing the 
assignment process for all ingress/egress module 
pairs. The scheduling period is preferably equal to 
the transfer allocation period. 

(35) Scheduling Efficiency: 

Ratio between the number of allocated seg- 
ments scheduled by the scheduling device during a 
scheduling period and the theoretical number of 
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allocated segments that can be scheduled by an 
ideal scheduling device during the same scheduling 
period using the same transfer allocation matrix. 
(36) Standby Traffic: 

Traffic streams, either connection-based or 
connectionless, with no service guarantees. 

DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

ro0261 This invention is related to switching packets 
^variable size. The objective is to switch the packets so 
that each packet leaves the switch in the same format in 
which it arrived at the switch. Traffic streams flowing 
through the switch are rate controlled from .ngress to 

[002^ It is well known to transfer variable-sized 
data packets through backbone networks by segment- 
ina the data packets into cells of fixed size at the net- 
work edges. Asynchronous transfer mode (ATM) works 
in this way. However, transferring vanable-sized data 
packets across a network without segmentate into 
fixed-size cells simplifies traffic management across the 
network and avoids the network link capacity wasted by 
partially filled cells and the cell-header overhead. 
[0028] In order to realize a switch of very-high 
capacity, of the order of tera-bits per second, a three- 
stage switch configuration may be used. A *ree-stage 
switch 30 is schematically illustrated m FIG. 1. A first 
stage comprises a number N, N > 1 . of ingress modules 
32 and a third stage comprises a number M. M > . of 
egress modules 36. A second stage is the core 34 of the 
switch 30. which interconnects the ingress modules 32 
with the egress modules 36. Normally, the number of 
ingress modules 32 equals the number of egress mod- 
ules 36. In a three-stage switch, the ingress and egress 
modules are paired. 

[0029] Each ingress module 32 is connected to the 
core 34 of the switch 30 by a high-speed inner link 38a 
The ingress module 32 may function as a buffered mul- 
tiplexer if ft supports several incomir* l.nks 40. The 
egress module 36 may function as a buttered demutb- 
pfexer if it supports outgoing links 42 of lower speed 
Than the speed of an inner link 38b from the core 34 o 
the switch 30 to the egress module 36. Buffering at 
egress is also required to regulate the traffic rate to var- 
ies network destinations. An egress module can be 
passive and bufferless if none of the outgoing links it 
supports has a speed less than the speed of the .inner 
,ink 38b from the core 34 to the egress module 36 pro- 
vided that rate-regulation is not applied to traffic 
streams flowing through the egress module. Each 
ingress module- 32- must have at least a short" ingress 
buffer regardless of the ratio of the speed of the incom- 
ing links 40 to the speed of the inner link 38a to the core 
34. incoming packets are buffered at the ingress mod- 
ule 32 because it may not be possible to transfer the 
data immediately to the core 34. 



[0030] In order to facilitate the management of traf- 
fic within the switch 30. it is common practice to sort 
incoming packets. The incoming packets are sorted into 
the ingress buffers such that the packets are separated 
s into traffic streams logically arranged according to the 
egress modules 36 from which the respective packets 
must egress from the switch 30. This facilitates the 
scheduling of traffic as described in United States Pat- 
ent No 5 126,999. which issued to Munter etal. on June 
, 0 30 1992. In the case of fixed-size cells, as in ATM 
switching, the ingress traffic may be sorted into a 
number of queues respectively corresponding to an 
egress module. In a switch 30 in accordance with the 
invention, variable-sized packets are internally divided 
J5 into data units hereafter referenced as packet seg- 
ments. The packet segments have a predetermined 
length to facilitate transfer within the switch 30. 
[0031] FIG. 2 illustrates a preferred structure for 
packet segments 44. Packet segmentation enables 
so switch simplicity at the expense of some waste of inter- 
nal switch capacity. The packets are reconstructed 
before they exit the switch in order to avoid capacity 
waste in the links interconnecting the switches. In the 
switch architectures based on a space-switched core or 
25 a rotator-based core, the segments 44 of each , p acke 
are received at a respective egress module 36 (FIG. 1) 
in an order in which they were formed on ingress to the 
switch 30. However, the segments of any given packet 
may be interleaved with segments of other packets ^des- 
30 tinedtothesameegressmodule36. ^^o^\^ 
the re-assembly of packets, each segment is preferably 
labeled using a unique identifier 46 associated witti an 
ingress module 32 at which the packet entered the 
switch 30. The unique identifier 46 associated with each 
35 ingress modules is preferably a sequential number 
starting from 0 to N-1 . for example, N being the number 
of ingress modules. By sorting the segments 44 at each 
egress module 36 according to the unique identifier 46. 
the segments 44 of any packet are juxtaposed in con- 
40 secutive order in an egress buffer that corresponds to 
the packet s ingress module. The unique identifier 46 is 
stored in a first field of a header associated wrth each 
packet segment 44. Each packet segment 44 also car- 
ries a 2-bit flag 48 to identify the position of a segment 
45 in the packet. The 2-bit flag identifies a segment as 
either a continuation segment, a full-length last seg- 
ment or a last segment with null-padding. In the latter 
case, the number of bytes in the last segment must be 
indicated, preferably by storing the length in a last S-b.ts 
so of a payload field, as will be explained below. 

r0032] As shown in FIG. 2. each packet segment 44 
has a two-field header. The first field contains ^the 
uriique-iderrtrrier 46 of a sufficient Tiumberof bits to ^en- 
tity the source module. The second f ield contains the 2- 
5S bit «ag 48. If the flag is set to "00". the third field is a full 
payload segment 50. and other segments belonging to 
S3 same packet follow. If theflag is set to "10". the third 
field is also a full payload segment 51 . but the segment 
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is the last segment in a packet. If the flag is set to "1 1 ", 
the segment is the last segment in a packet and the 
segment is also a partial payload segment. The last S- 
bits 52 of the partial payload segment 54 indicate a 
length of the partial payload segment 54, as shown in 5 
FIG. 2, where S > log 2 (A), A being the number of pay- 
load bytes per segment. The length of the payload seg- 
ment 50 is internally standardized, at 128 bytes for 
example. A packet is transferred to the network by the 
egress module only after its last segment, the segment 1 
with the last segment flag set to "10" or "11" is com- 
pletely received. In order to simplify the scheduling 
process, the packet segments 44 are preferably trans- 
ferred in parcels, with each parcel comprising a number 
of packet segments, four segments for example. i 
[0033] FIG. 3 illustrates the packet segmentation 
process, showing the wasted capacity resulting from the 
partial segment 54 of most packets. The length of a 
packet may be a non-integer multiple of the length of the 
packet segments 44, or an integer multiple of a length of '< 
the packet segment as shown in packet 55. 
[0034] FIG. 4 illustrates the reciprocal sorting proc- 
ess in accordance with the invention. As each packet 
arrives at the switch 30, the packet 43 is divided into 
packet segments 44 by the ingress module 32, as 
described above. The packet segments 44 are then 
sorted into ingress buffers 56. each of which corre- 
sponds to an egress module 36 from which the packet is 
to egress from the switch. At each egress module 36, 
the segments are sorted into egress buffers 58 accord- 
ing to their unique identifier 46 (FIG. 2). Other sorting 
schemes which, for example, are based on additional 
factors such as quality of service (QoS) distinctions may 
also be implemented. 

Switch Architecture 

[0035] "me two switch core architectures, space- 
core and rotator-based, referred to above have many 
characteristics in common. The first uses a space- 
switch stage in the core and is hereafter referenced as 
a "space-core switch". The second uses a core stage 
comprising two rotators and a bank of core memories 
and is hereafter referenced as a "rotator-based switch". 
The rotator-based switch has a simpler control mecha- 
nism and has a higher capacity upper bound. Using 
either core architecture, a path of dynamically variable 
capacity can be established between each ingress 
module and each egress module. FIG. 5 illustrates the 
space-core architecture. The switch 30 includes a 
space-core 62 that switches packet segments 44 
between the ingress modules 32 and the egress mod- 
ules 3# Each ingress module 32*communicatesr infor- 
mation to a traffic controller 100. The information is 
related to a traffic load and committed capacity (service 
rate) for traffic to be switched to each of the respective 
egress modules 36. The respective committed 
ingress/egress capacities are established by a switch 



controller (not illustrated) that is responsible for traffic 
admission control. Packet segments 44 are transferred 
to the core 62 from the ingress modules 32 via inner 
links 38a and from the core to the egress modules 36 
via inner links 38b. 

[0036] FIG. 6 schematically illustrates a switch 30 
that is constructed using the rotator switch architecture. 
The switch 30 shown in FIG. 6 is quite similar to the 
space-core switch shown in FIG. 5. The space-switched 

? core 62 is replaced by a rotator core that consists of two 
rotators 63 (FIG. 7) and a bank of parallel core memo- 
ries 66 that in combination constitute a rotator-based 
core 64. The switch architecture shown in FIG. 6 does 
not require a central traffic controller. However, central 

5 control can be optionally used to improve utilization of 
the switch 30. 

[0037] FIG. 7 illustrates in more detail the core 64 of 
the rotator-based switch shown in FIG. 6. There are N 
core memories 66, each of which is logically partitioned 

o into N sections 68, each section 68 being implicitly 
associated with an egress module and adapted to store 
a predetermined number K, 8 for example, of packet 
segments 44. Each inner link 38a that interconnects an 
ingress module 32 (not shown) to a rotator port (not 

>5 shown) accesses each core memory 66 during a rotator 
cycle. The access duration is at least sufficient to trans- 
fer the predetermined number of packet segments K to 
the accessed memory section 66. The K packet seg- 
ments 44 need not belong to the same packet 43 or to 

30 the same egress module 36, i.e., the packet segments 
44 may be transferred to different sections 68 of the 
core memory 66. However, a given packet segment 44 
can only be transferred to a memory section 68 that cor- 
responds to an egress module 36 from which the packet 

35 43, to which the packet segment belongs, is to egress 
from the switch 30. 

[0038] FIG. 8 illustrates the transfer of packet seg- 
ments from an ingress buffer 56 of an ingress module 
32 to an accessed core memory 66 in accordance with 
40 a first embodiment of the invention. In the example 
shown, at most eight packet segments 44 are permitted 
to be transferred during the access interval (K = 8). The 
squares 70 shown in the core memory 66 represent 
packet segments 44 already transferred by other 
45 ingress modules 32, and not yet read by the respective 
egress modules 36. The circles 72 shown in the core 
memory 66 indicate the packet segments 44 transferred 
to a given memory section 68 during the illustrated 
access interval. This transfer process is referred to as 
so "heterogeneous segment transfer". The computation of 
the transfer allocations will be described below in more 
detail. 

[0039] * FIG. 9 shows' a simpler implementation; 
referred to as "homogenous segment transfer", of the 
55 process schematically illustrated in FIG. 8. In homoge- 
nous segment transfer, segments may only be trans- 
ferred to a single core memory section 68 (i.e., to a 
single egress module 36) during an access interval. 
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*>This implementation is simpler but is less efficient 
because there are occasionally insufficient packet seg- 
ments 44 in any given ingress buffer 56 to fill an entire 
core memory section. Utilization is improved with heter- 
ogeneous segment transfer in which the transferred 5 
segments are written to several memory sections 68. 
The procedure illustrated in FIG. 9 simplifies the sched- 
uling process at the expense of a reduced throughput. 
The throughput can be improved by introducing an 
acceptable delay between ingress and transfer to a core 10 
memory. This will be described below with reference to 
FIG. 11. 

[0040] FIG. 10 schematically illustrates the inter- 
leaving of packet segments 44 in the core memory of a 
rotator-based switch. The column on the left side (a) of 15 
FIG. 10 is an illustration of an implementation in which 
each section in a core memory can hold only one seg- 
ment. The right side (b) of FIG. 10 illustrates an imple- 
mentation in which a core-memory section can hold four 
segments. The segments of a packet destined to an so 
egress-module Y may be transferred to sections bear- 
ing the same egress-module designation Y but in differ- 
ent core memories. The packet segments 44 belong to 
different packets 43 that are respectively to egress from 
the same egress module 36. In the example shown, in 2s 
FIG. 10(a) packet segments 44 from four ingress mod- 
ules identified by the numbers 5-8 are interleaved in the 
core memory. Similar interleaving takes place due to 
scheduling conflicts in the architecture of the space- 
core switch 62. In a space-core switch 62, a matching 30 
process is required between the ingress side and 
egress side of the switch. The problem of matching in 
multi-stage switches had been extensively studied and 
reported in the literature over several decades. It is well 
known that matching consecutive slots is difficult to real- 35 
ize and successive segments sent by a given ingress 
module may have to be transferred during non-adjacent 
time slots. Consequently, the packet segments 44 arrive 
at an egress module interleaved by packet segments 
transferred from other ingress modules. The unique 40 
identifier 46 (FIG. 2) stored in the segment header per- 
mits the segments to be sorted into egress buffers so 
that the packets may be reassembled in the same con- 
dition in which they entered the switch 30. The packet 
segment interleaving shown in FIG. 10 therefore does 4s 
not present a problem in a switch in accordance with the 
invention. 



Aggregation 



so 



[0041 ] The scheduling process can be simplified by 
aggregating segments into parcels, and controlling the 
transfef of parcels td the core: In the spaceniore switch, * ' 
an ingress module is connected to one egress module 
during an access interval, and a guard time may be 55 
required between successive access intervals. Prefera- 
bly, therefore, data should be transferred in parcels and 
each parcel should have a sufficient number of seg- 



ments to reduce the relative guard-time overhead. A 
guard time is also needed in the transition between suc- 
cessive core memories in a rotator-based switch. In the 
rotator-based switch, the number, K, of segments that 
can be transferred during an access interval is selected 
to be large enough to reduce the relative waste due to 
the idle time during transition between successive core 
memories, and small enough to reduce the rotator-cycle 
duration. The rotator cycle duration is the product of the 
access interval and the number of core memories 66. 
The access interval is determined as D = (K + k) x d , 
were k as the number of idle slots resulting from the 
transition between core memories, and d is a segment 
transfer time. With K = 16, d = 25 nanoseconds, k = 2, 
and N = 128, for example, the access interval D = 0.45 
microseconds, and the rotator period is 57.6 microsec- 
onds. The K segments transferred in an access interval 
may be destined to egress from several different egress 
modules. This increases the computational effort 
required for scheduling. As explained above, scheduling 
can be simplified by aggregating segments into parcels, 
the number q of packet segments per parcel may be 
greater than 1 and less than or equal to K. In the above 
example, a parcel may have 4 segments, and the maxi- 
mum number of matching egress destinations to be 
found during an access interval is reduced from K = 16 
to m = K/q = 4 . 

[0042] The packet segments in a parcel may belong 
to different packets but must be destined to egress from 
the same egress module. As discussed above, the traf- 
fic at each ingress module is sorted into logical buffers 
according to the egress-module from which it is des- 
tined to egress. The parcel formation process must 
attempt to increase efficiency by ensuring that most par- 
cels are full. This may be difficult to achieve, however, 
under severe spatial imbalance in traffic load. To cir- 
cumvent this difficulty, parcels are formed from an 
ingress buffer only when the number of segments in the 
buffer equals or exceeds the parcel size, q, or if the wait- 
ing time of the head-of-buffer segment has reached a 
predetermined threshold. The larger the threshold the 
lower the rate of transfer of incomplete parcels and, 
hence, the higher the aggregation efficiency. 
[0043] The aggregation efficiency depends on the 
spatial and temporal cfistribution of the traffic, which is 
difficult to quantify in real time. The highest aggregation 
efficiency is realized when all the traffic arriving at an 
ingress module is destined to egress from a single 
egress module. A high efficiency may also be realized 
with a reasonable delay threshold if the traffic is spatially 
balanced. Aggregation efficiency decreases signifi- 
cantly in a worst case in which a very large proportion of 
ther traffic arriving "at* an ingress? module' is^ destined- tcr 
egress from a single egress module and the remainder 
is distributed evenly among all other egress modules. To 
quantify this worst case, the threshold is expressed as 
an integer multiple, T, of an access interval D. During 
the T access intervals, one of the sorted logical buffers 
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is almost always ready to form complete parcels while 
each Slhe reLning N-1 logical buffers contains on* 
one segment each. At the expiry of the delay threshold 
N-1 panels must be formed, each containing only one 
pack? segment while the rest of the parcel .s padded 
S nu.l segments. The computation ^ 
efficiency under these conditions will be described 
Slow and used to quantity an internal capacrty expan- 
sion to offset the effect of null padding. 

Aggregation Device 



15 



20 



25 



30 



10044] At ingress, arriving packets are segmented 
as described above and the segments are grouped into 
Leal buffers according to egress-module desfnator, 
-Segments may also be further separated .nto traffic 
streams according to class of service. Segments ^of the 
same ingress, egress, and Cass of 
form a traff ic stream. A service rate may be abated to 
each traffic stream at the ingress module. The sum of 
The service rates allocated to the traffic streams of drffer- 
ent classes of service, but of the same ingress-egress 
pair, is reported to a central rate - a,,ocati0 l detf, ^ montB 
100451 If it is decided to aggregate the segments 
nto parcels, in order to simplify the scheduling process^ 
as described earlier, steps should be taker .to mmnz. 
the aggregation waste. A simple rule .s to permit parcel 

Ln£n only when there ^^^t^Z 
parcel, or when a delay threshold is reached. Thedeiay 
fhTeshold may vary according to class of 
size q of a parcel (the maximum number of segments 

legating packet segments for a given destna tion 
bKnS ?o four classes of service. The aggrega*on 
device 9 comprises a memory 81 for storing a delay 
threshold for each class of service, a memory 83 for 
Soring a time-stamp for each class of service and a 
memory 85 for storing the number of waiting segments 
pTda^s of service. The segments are stored in a 
EnSyV which is logically parttfcnedinto Qj-ujj 
O beinci the number of classes of service. The delay 
fhSds'are determined according to the deliver- 
ance of different classes of service. In the example of 
FIG. 1 1 , with Q = 4, the delay threshold vanes from 8 
time units, for a delay sensitive class of service, to 512 
Se units, for a delay-tolerant class of service. The 
respective delay tolerances are stored in the delay 
threshold memory 81. 

[0W71 When a segment of class j is generated at an 
ngress module, it is stored in memory 87 in a respective 
^logical queue, and the corresponding entry n mem- 
'or y « t increased by 1 - H the j* entry 0 *i< Q -n mem- 
ory 85. modulo q. is equal to 1. the entry in , memo y 
83 is overwritten by the current do* Th* an ert^n 
memory 83 always indicates *e ^.^^J* 
seament which determines the eligibility of transfer ot 
tS stents in the respective segment queue in 



memory 87. The clock time is cyclical. The length of the 
dock cycle is determined according to the largest delay 
threshold. 

[0048] Entries 0 to Q-1 in memory 83 and memory 
s 85 are read cyclically to determine when waiting seg- 
ments, if any. may be (logically) transferred from mem- 
ory 87 to a ready queue 91 . using a round-robin selector 
89 Parcels are formed directly from the ready r queue 
91 and a parcel may contain segments of different 
,o classes. The scanning interval, i.e.. the time taken to 
read and process the entries relevant to any of the Q 
classes, is preferably shorter than a segment transfer 
tjme During a scanning interval, corresponding to class 
j. 0 < j < Q. the following process is performed: 



(1) If the j th entry in memory 85. storing the seg- 
ment count for class j. equals or exceeds q. q of the 
waiting segments stored in the j th logical queue in 
memory 87 are logically entered in the ready queue 
91 and the segment count in memory 85 is reduced 
by q If the remainder is not less than q, the process 
is repeated. Otherwise, the scanner moves to the 
following class. _ 
(2) « the j* 1 entry 0 £ j < Q in memory 83 indicates a 
delay exceeding the threshold for class j. the wait- 
ing segments in the I th segment buffer .n memory 
87 are logically transferred to the ready queue 91^ 
the corresponding entry in memory 85 is reset to 
zero, and the scanner moves to the following class. 

[0049] It is noted that when a segment is logically 
ransferred from memory 87 to the ready queue 91 . the 
ipeSve segment-queue size, indicated in memory 
8?Tr2uced despite the fact that the segments will 
35 SuallX removed'from memory 87 upon de-queuing 
" IZ L ready queue 91 at a later time jDunng an 
access interval, the ready queue 91 transfers to the 
Se q segmenis or all waiting segments, whichever « 
less, via data links 93. 

Transfer Allocation and Traffic Scheduling 

[0050] There are two crucial steps that determine 
Se performance and throughput of the switch 30 in 
« accordance with the invention. The first is the adaptrve 
allocation of ingress/egress path capacity, referred to as 
-transfer allocation". The second is the scheduling of 
retimes for allocated P-^J^jS^ 
be transferred from the ingress modules 32 to the core 
so 34 of the switch 30. referred to as "transfer assignment . 
?005lT F1G 12 schematically illustrates the process 
of transfer allocation and transfer alignment In the 
transfer allocation process, the number of parcels that 
lacMngrermodule 32 is permitted to transfer to each 
55 ewesL module 36 during a specified transfer allocation 
perSirdetermined. The length of a transfer allocator, 
ner S is determined based on architectural cons«lera- 
SonsTa manner well known in the art. The transfer 
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assignment process is the implementation of the trans- 
fer allocation process. Two methods for effecting the 
transfer assignment process will be described. In 
accordance with a first method, the transfer assignment 
process is localized in each ingress module. This is 5 
referred to as the "distributed transfer assignment proc- 
ess". This method is suitable for the rotator-based 
switch architecture. Accordingly, an ingress module 32, 
which may have packets to be transferred to several dif- 
ferent egress modules 36, selects core memory sec- w 
tions 68 corresponding to one or more of the egress 
modules without coordination with other ingress mod- 
ules. In accordance with the second method, a central 
controller schedules the data transfer for all 
ingress/egress pairs during each scheduling cycle. This 1S 
is referred to as the "centralized transfer assignment 
process". 

[0052] FIG. 1 2 illustrates an arrangement in which a 
transfer allocation mechanism 105 receives traffic data 
from the ingress modules 32 via data links 101. The 20 
transfer allocations are either distributed directly to the 
ingress modules 32 to perform distributed transfer 
assignment, or communicated to a transfer assignment 
mechanism 107. The duration of the transfer allocation 
cycle is switch-architecture dependent. In a space-core 25 
architecture, the duration of the transfer allocation cycle 
is arbitrary. However, the duration of a transfer alloca- 
tion cycle should be long enough for the computational 
circuitry to complete required computations, yet short 
enough to closely follow variations in traffic volume or 30 
composition. In the rotator-based architecture, the 
transfer allocation cycle may also be of arbitrary dura- 
tion, however, it is advantageous to set the duration of 
the transfer allocation cycle to an integer multiple of the 
rotator period. The rotator period is the time taken for an 35 
ingress module to visit each core memory 66 in the core 
64 (FIG.7). For example, if there are 128 core memories 
66 in core 64, and if the access interval required for an 
ingress module to visit a core memory 66 in core 64 is 
0.50 microseconds, then the rotator period is 0.50 x 1 28 40 
= 64 microseconds. If the transfer allocation cycle is 
selected to be an integer-multiple of the rotator period, 
the transfer allocation cycle preferably has a duration in 
microseconds of 64, 128, 192, etc. 

Determining the Unused Capacity 45 



[0053] The admission-control mechanism (not 
shown), whether based on predictive or adaptive meth- 
ods, ensures that the committed rates for the traffic so 
streams under its control do not exceed the available 
capacity. Standby traffic may coexist with rate-controlled 
traffic and e-xpioitthe fluctuating unused capacity, whictr 
comprises the uncommitted capacity and the capacity 
unclaimed by the rate-regulated traffic. 55 
[0054] The transfer allocation process is described 
with reference to FIG. 13, which illustrates an example 
of a switch of four ingress modules 32 and four egress 



modules 36. The type of switch core 34 used between 
the mgress modules and the egress modules is irrele 
vant to this process. The capacity of the inner link 38a 
(NG. 1) from each ingress module 32 to the core 34 
and the capacity of the inner link 38b from the core 34 to 
each egress module 36 are equal. Each inner link has a 
nominal capacity of 100 packet segments per transfer 
allocation cycle in this example. 
[0055] A matrix 111 stores the number of waiting 
packet segments 44 for each ingress/egress pair This 
matrix is assembled from data communicated during 
each transfer allocation cycle by the individual ingress 
modules 32 to the transfer allocation mechanism 105 
shown in FIG. 12. The data is communicated through 
data links 101 as described above. Each ingress mod- 
ule communicates a respective row in matrix 111. Array 
1 13 and array 1 15 are shown for purposes of illustration 
only and are not necessarily maintained by the transfer 
allocation mechanism 105. Array 113 shows the total 
number of packet segments waiting at each ingress 
module 32, and array 115 shows the number of packet 
segments destined to each egress module 36. 
[0056] Matrix 117, used by the transfer allocation 
mechanism, stores the committed capacity (guaranteed 
minimum capacity) for each ingress/egress pair. The 
committed capacity is the result of one of several admis- 
sion-control mechanisms such as: (1) explicit specifica- 
tions received through the incoming links 40 (FIG. 5) 
feeding the ingress modules 32; (2) an equivalent-rate 
computation based on traffic descriptors; (3) predictive 
methods based on short-term traffic projection; or, (4) 
adaptive observation-based specifications. Each of 
these mechanisms is known in the art. An array 119 
shows a total committed capacity of each ingress mod- 
ule and an array 121 shows the total committed capacity 
for each egress module. Arrays 1 19 and 121 are shown 
for purposes of illustration only, they are not required by 
the transfer allocation mechanism 105 (FIG. 12). 
[0057] Matrix 123 shows the minimum number of 
packet segments to be transferred per transfer alloca- 
tion cycle by each ingress/egress pair based on initial 
grants by the transfer mechanism 105. Arrays 125 and 
127, shown for purposes of illustration only, respectively 
store the total initial grant for each ingress module 32 
and each egress module 36. Each entry in matrix 123 is 
the lesser of the values of corresponding entries in 
matrices 111 and 117. For example, there are twenty- 
seven packet segments waiting at ingress module 0 that 
are to egress from the switch 30 at egress module 2. 
The committed capacity for ingress/egress pair (0, 2) is 
22 according to matrix 111. Initially, in a first step, 
ingress/egress pair (0. 2) is allocated 22 packet seg- 
ments per allocation cycle as' indicated in matrix v 123; 
This allocation may be increased in subsequent steps of 
the allocation process, as will be described below 
Ingress/egress pair (2, 3) has 12 packet segments wait- 
ing while the committed capacity is 24 packet segments. 
The entry (2, 3) in matrix 123 is 12 and the extra 12 
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packet segments are available for potential use by othe 
^ress/egress pairs having module 2 as the ingress o 
Se 3 as the egress module. Thus, the purpose of 
£ntf acting matrix 123 is to facilitate adaptive capacrty 
S Capacity sharing is realized by distnbuting , 
unused capacity. Matrix 152 stores <^"~ 
traffic i e.. the number of waiting packet segments that 
exceeds the committed ingress/egress aHocations^This 
excess traffic is treated as standby traff.c. Arrays 131 
and 133 shown for illustration purposes only, store the w 
excess (standby) traffic aggregated per ingress and 
egress, respectively. 



is 



Allocation of Unused Capacity 

[0058] m order to increase the efficiency of the 
switch and facilitate the transfer of connectionless traffic 
through the network, a switch in accordance wrth the 
invents allocates unused switch capacity to the 
ingress modules to permit the transfer of waitmg packet 20 
segments *4 This allocation may be accomplished 
using «th«. a traff.c- independent distribution methoo. or 
a traHK:^«pendent distribution method. The traffic-inde- 
pendent method s less computationally intensive and 
STeto" -^ec ,0 implement. The traff^ependent » 
method «. more computationally intensive but it is more 
effioent Eacn method is described below. 

Trail ic-lndependent Distribution of Unused Capac- 
ity 



total excess capacity 149 is obtained by summing either 
of the excess ingress capacity (array 1 43) or the excess 
egress capacity (array 145) 

[0061] This process is traffic-independent and 
requires N 2 computations, each computation being a 
multiplication followed by a division. The N z division 
operations can be reduced to N division operations by 
initially dividing one of the multiplicands. X (array -143) 
or Y (array 145). by the total excess capacity 149. In 
doing so. one of the multiplicands is modified by left- 
shifting by 'B' bits, B being an integer greater than 8. 10- 
bits for example (equivalent to multiplying by 1024) 
before the division. The result of pair-wise multiphcation 
of elements of array 145 and the modified array 1 143 ; (or 
elements of array 143 and a modif ied array 145) is then 
right-shifted the same B bits. 10-bits in this example 
(equivalent to division by 1024) with rounding. The 
result is the number of packet segments that is allocated 
for transfer to each ingress/egress pair. The above proc- 
ess can use as many parallel multiplication units as 
practically feasible, since the N 2 multiplications are 
independent of each other. The sum of the entries in a 
row in matrix 1 41 may be less than a respective entry in 
array 143 due to rounding. A column sum in matrix 141 
may differ from a respective entry in array 145 for the 
same reason. 



Traffic-Dependent Allocation of Unused Capacity 
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[00591 T,«tc-idependent distribution of unused 
swrtcn capaot, * accomplished using a matrix 140 that 
^es^L* capacity of a switch in accordance 
with^o^on consrsting of four ingress modules 32 35 
Tod fou. ecw mxxlu.es 36. An array 143 stores the 
extS^L capacrty of each ingress module 32. The 
excess capacrty is a sum of the uncommitted 

commmrt c*-c*» of each ingress modul ,32. As 
described «xv« me committed capacrty of each link 
served by an ry« module is determined by admis- 
sion corwoi -n** admits each connection-basedses- 
sion w-th a cormrtrf ,ngress-to-egress capacrty for the 
conn^on T». unused (excess) capacity for ingress « 
and egress modules « stored in arrays 143 and 145 
respective.. Array 143 is derived from array 125 where 
• each entry « array 143 equals 100 minus its counter- 
part in array 125 Array 145 is derived from array 127 in 
a similar manner. The nominal inner-link capacrty is 100 so 
units in this example, as described above. 
[0060] A simple way to distribute the ^ 
ity fe w use a "gravity- approach, in which the transfer 
allocation for each ingress/egress pair is P'°£ r *° na ' £ 
a product obtained by a murtipHcation of the excess ss 
capacity at ingress (array 143) <™M^.^ "J* 8 
S?ess capacity at egress (array 145) CmuJ^JJ- 
The product is divided by the total excess capacity. The 



[0062] Asshown in HG. 15, matrix 152(D) obtained 
by matrix subtraction of matrix 111 minus mafrix 123 
stores the waiting traffic in excess of the comrmtted 
capacity for each ingress-egress pair. As descnbed 
above this excess traffic is treated as standby traffic^ 
The traffic is expressed in terms of a number of parcels 
per allocation period. Array 143 stores the total unused 
capacrty for each ingress module 32 and array 145 
stores the total unused capacity for each egress module 
36 Arrays 143 and 145 are required to .mplement the 
ranacrtv-sharing procedure described below. 
m 0 P 6 T A iXero entry in matrix 152(D) indicates 
waiting traffic in excess of the committed capacrty for 
the entry. The number of non-zero entries in matt* 
1 00(D) can be as high as the number of '"^ess/egress 
pairs in general, it is desirable to fully utilize the 
ingress/egress paths. In the network context, rate regu- 
Son. is actually only relevant at the egress module- 
.ngress/egress rate regulation is stnctJy an .nternal 
ih^elgn issue. It is desirable to aHocate the 
excess ingress capacity and excess egress capacrty. as 
Sated'in arrays 143 and 145 
to enable the transfer of excess traffic to ^egress ^ 
modules. With 128 ingress modules 32 and 128 egress 
modules 36. for example, up to 1 6384 entries may need 
to be processed during an allocation period (of a 100 
microseconds, for example). Each entry may be exam- 
Zed to determine if the indicated number of packet seg- 
ments 44 can be fully or partially transferred dunng the 
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•subsequent transfer allocation period, based on the cur- 
rent uncommitted capacity of the inner links 38a,b to the 
core 34 of the switch 30. The process must be com- 
pleted within a transfer allocation period of a reasonable 
duration, and parallel processing is required. To avoid 
transfer allocation collisions, where two or more 
ingress/egress pairs may be allocated the same unused 
capacity at a given egress link, the parallel process is 
preferably implemented using a moving-diagonal 
method. In the moving-diagonal method, ingress- 
egress pairs of a diagonal set (as defined earlier) are 
processed simultaneously. Within a diagonal set, each 
ingress module 32 is considered only once and each 
egress module 36 is considered only once in the paral- 
lel-computations for the transfer allocation process and 
no conflicting transfer allocations occur. After process- 
ing each diagonal-set, the unused capacity of each 
ingress and egress module is updated to account for the 
diagonal-set allocation result. 

[0064] If the diagonal-set pattern were repeated in 
each scheduling cycle, the ingress/egress pairs of the 
diagonal-sets considered first would get a better share 
of the unused capacity than the ingress/egress pairs of 
the diagonal-sets considered later in the cycle. To avoid 
potential unfairness, the starting diagonal is shifted after 
each N diagonal transfer allocations. N being the 
number of ingress (or egress) modules. 
[0065] The diagonal transfer allocation procedure 
will now be explained with reference to FIG. 15. A diag- 
onal is defined by its first element. For example the 
diagonal from (0, 0) to (N-1, N-1) is labeled as diagonal 
(0, 0). In this example, the starting diagonal is selected 
to be (0, 0), as shown in matrix 152(E) of FIG. 15. Entry 
(3, 3) in matrix 152(D) contains a number representative 
of six packet segments 44. Ingress module 3 has an 
unused capacity (array 1 43(D)) of five packet segments, 
while egress module 3 (array 145(D)) has an unused 
capacity of thirty-four packet segments. The maximum 
number of packet segments that may be allocated is 
then the minimum of 6, 5 and 34. The allocated value of 
5 is subtracted from entry (3, 3) in matrix 152(E), entry 
(3) in array 143(E) and entry (3) in array 145(E), yielding 
the corresponding result in matrix 152(E), array 143(E), 
and array 145(E) of 1 , 0, and 29 respectively. 
[0066] The second diagonal is selected to be (0, 1) 
as shown in matrix 152(F) of FIG. 15. The entry (0, 1) in 
matrix 152(D) has four packet segments waiting for 
transfer, and the unused capacity in the corresponding 
ingress and egress modules are thirty-four and thirty-six 
packet segments, respectively. The four packet seg- 
ments are therefore allocated, and the corresponding 
unused capacity in ingress 0 and egress 1 are reduced 
tcrthirty and thirty-two packet segments, respectively; as* 
shown in arrays 143(F) and 145(F). Entry (1, 2) has 
twenty packet segments waiting for transfer, while 
ingress 1 has thirty-seven uncommitted packet seg- 
ments and egress 2 has fifteen uncommitted packet 
segments. Thus, only fifteen out of the twenty packet 
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segments are allocated, and the uncommitted capaci- 
ties in ingress 1 and egress 2 are reduced to twenty-two 
and zero, respectively. 

[0067] This process continues until any of the fol- 
s lowing conditions is met: (1 ) all diagonals are visited as 
shown in matrices 152(E) to 152(H); (2) all the waiting 
traffic is served; or, (3) the excess capacity is 
exhausted. In the following transfer allocation period, 
the procedure preferably starts at a diagonal other than 
w the diagonal (0, 0) to (3. 3) in order to facilitate a fair 
spatial distribution of the transfer allocations of the wait- 
ing traffic. For example the order may be {(0, 0), (0. 1), 
(0, 2), (0, 3)} in one cycle and {(0, 2), (0, 1), (0, 3), (o! 0)} 
in the following cycle. 

75 

Parcel Assignment 

[0068] The ingress/egress capacity is modified 
every transfer allocation cycle according to the varying 
20 traffic loads using the mechanism described above. 
However, it may not be possible to fully utilize the allo- 
cated capacities due to scheduling conflicts resulting 
from the inherent difficulty in fitting the transfer alloca- 
tions in a time calendar. An efficient scheduling proce- 
ss dure is therefore required to effect a high utilization of 
the allocated capacity. If the scheduling process is 
designed to achieve a high efficiency, a small expansion 
of internal switch speed with respect to combined 
ingress link speeds would be required to offset the effect 
30 of mismatching. Some additional expansion is also 
required to compensate for the waste that results from 
the padding probably required in the last packet seg- 
ment 44 of most packets, and if the packet segments 
are aggregated into parcels, then the aggregation waste 
35 must also be taken into account in the calculation of the 
required overall expansion. 

[0069] As mentioned above, either of two parcel 
assignment methods may be used. The first method 
determines the assignment in a distributed manner in 
40 which each ingress module selects, within a single 
access interval, the core memory sections available to 
accept parcels waiting for transfer. The second method 
is based on looking ahead, scheduling the assignment 
over a predefined scheduling cycle of several access 
45 intervals using a centralized scheduler. 

[0070] The procedures described above for transfer 
allocation and transfer assignment have been described 
with reference to segments as the data units transferred 
from the ingress modules 32 through the switch core 34 
so to the egress modules 36. Those procedures also apply 
equally to the case in which the data units transferred 
are parcels. 

Distributed Assignment 

55 

[0071] The distributed assignment method is used 
herein only for the rotator-based architecture. In the 
rotator-based architecture, the transfer of parcels to the 
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with reference to FIG. 17. 
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the maximum number, m, of parcels that can be trans- 
ferred during an access interval. 
[0079] When m > 1 , i.e., when more than one parcel 
may be transferred during an access interval, as in the 
case of a rotator-based core, the matching process may 
be implemented as follows: 

(1) At any instant, N transfer allocation memories 
and N egress-state memories in a diagonal-set are 
paired. Corresponding entries of paired memories 
are compared to determine the number of parcels 
that may be transferred from each ingress module 
to each egress module. At most m parcels may be 
transferred during an access interval; 

(2) Subtract the outcome from the respective 
entries in transfer allocation memories and the 
egress-state memories; 

(3) Perform steps (1) and (2) for all diagonal-sets; 

(4) Repeat steps (1), (2), and (3) using a different 
order of diagonal-sets. 

[0080] As mentioned above, although the transfer 
allocation process ensures that the content of the trans- 
fer allocation memories can be accommodated during a 
scheduling interval, it is possible to obtain a non-zero 
remainder after all of the diagonal sets are processed in 
the scheduling cycle. This may result from imperfect 
scheduling. Near-perfect scheduling requires relatively 
extensive conputational effort, and consequently a rel- 
atively long scheduling period. However, to ensure that 
the scheduling process closely follows traffic variations, 
it is desirable to keep the scheduling period short, dic- 
tating a relatively simple scheduling method. A simple 
way to compensate for scheduling mismatch loss is to 
provide a reasonable additional internal expansion in 
transfer speed between the inner and outer sides of 
each ingress module 32 and, similarly, between the 
inner and outer sides of each egress module 36. 

Centralized Matching Device 



[0081] The centralized matching device, FIG. 16, 
includes a bank of N egress-state memories 164, a 
bank of N transfer allocation memories 166, N matching 
circuits 168, and N schedule-storage memories 172. A 
transfer allocation memory 166. a matching circuit 168, 
and a schedule-storage memory 172 are associated 
with each ingress module 32. The matching of a transfer 
allocation array of an ingress module and the N egress- 
state arrays over a scheduling cycle rs enaWed J using a 
rotator 162 as shown in FIG. 16. 
[0082] Each egress-state memory 164 has N 
entries, each entry representing an occupancy state of 
an egress module during an access interval. In the case 
of the space-core architecture, the state of an egress 



module is represented by one bit because only one par- 
cel can be transferred to an egress module during an 
access interval (the parcel may contain several packet 
segments). In the case of a rotator-based switch archi- 
5 tecture, the egress-module states during an access 
interval are represented by the vacancy of each of the N 
sections 68 of the core memory 66 (FIG. 7) to be 
accessed. The vacancy of a section is the number of 
parcels that can be accommodated by the section. 
to [0083] Each transfer allocation memory 166 has N 
entries (not shown), and each entry holds the number of 
parcels to be transferred from the ingress module 32 to 
each of the egress modules 36 within the scheduling 
period. The transfer allocations are updated during the 
75 scheduling process as will be described below. 

[0084] The matching circuit 168 compares the 
transfer allocations of an ingress module 32 with the 
egress-states as read from the egress-state memory 
164 to which the ingress module is connected during an 
20 access interval. The comparison yields a number, pos- 
sibly zero, of parcels that can be transferred to the core 
memory 66 accessed during an access interval and the 
egress-module designation of each parcel. At most m 
parcels can be selected, m being equal to 1 in the case 
25 of a space-core switch, and up to K in the case of a rota- 
tor-based core. K being the maximum number of seg- 
ments a core-memory section can hold. If m > 1 . the m 
parcels may be destined to a number, j, of egress mod- 
ules, 0 j < m. Each of the j parcels is identified by its 
30 egress-module designation. The number j (which may 
be zero) is written in the schedule-storage memory 172 
associated with the respective ingress module, followed 
by the egress module of each of the j selected parcels 
(if j > 0). The contents of the N schedule-storage mem- 
35 ories 172 are pipelined to the respective ingress mod- 
ules 32 and each ingress module forms a parcel- 
transfer schedule to follow in the subsequent scheduling 
period. 

[0085] FIG. 17 shows an example of matching 
40 results for the case of eight parcels per access-interval 
(m = 8) in a rotator-based switch. In the example, 
ingress-module 0 sends a total of 8 parcels to egress- 
modules 0, 2, 7, and 9, while ingress-module N-1 sends 
a total of 6 parcels to egress-modules 5, and 8, with two 
45 idle parcel-slots. Ingress-module 0 sends one parcel to 
egress module 0. four parcels to egress-module 2, two 
parcels to egress-module 7. and one parcel to egress- 
module 9. 



so Speeding-Up the Matching Process 

[0086] The matching process takes place simulta- 
neously for all the- memory pairs of a diagonat set: Dur* 
ing a scheduling cycle, all diagonal sets are processed. 
55 The number of parcels that may be transferred from an 
ingress module to each egress module for the entire 
scheduling cycle is determined during the allocation 
process described above. 
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[00871 The matching process determines the 
number of parcels that can be transferred from an 
ingress module x to selected egress modules dunng a 
given access interval. The matching P"""^" 
that up to N comparisons of two arrays be performed to 
seletf the parcels that can be transferred f rom , an 
ingress module during an access '^ erva ^ Tn ^' 
number of parcels assigned for transfer .s .bounded I by 
the predetermined maximum number m of parcels that 
may be transferred in an access interval. 
[00881 In order to speed up the matching process 
described above, each transfer-allocation array 166 
(FIG 16) of N entries may be subdivided into a number 
of sub-arrays, each having a smaller number of entries 
(FIG 19) Each egress-state array 164 is likew.se sub- 
divided, so that there is a one-to-one correspondence 
between the entries of a transfer-allocation sub-array 
and the entries of an egress-state sub-array. Thisper- 
mits parallel matching of the entries of the respective 
sub-array pairs and hence reduces the time required to 
complete the matching process. 
[0089] To realize parallel matching of the sub- 
arrays the N entries in each memory are placed in a 
number of separate sub-array memories so that several 
sub-array matching processes can take place simulta- 
neously and the outcomes of the sub-anay matching 
processes are subsequently examined to select the par- 
cels to be transferred. . mant . 
[0090] The matching process may be implemented 
n a multi-stage parallel process. To illustrate the benefit 
of multi-stage processing, consider the case whjreN- 
256 and m = 16. A single-stage search for matching 
slots requires up to N comparisons followed by up to m 
additions, where m is the maximum number of parce 
transfers per access interval. Even though, in most 
cases, the number of comparisons .s likely to be much 
less than N. the design of the matching device should 
be based on N comparisons. The time requ.red .s then 
determined by the time taken to execute 256 compari- 
sons and up to 16 additions. 

[0091] If each of the arrays to be matched is divided 
into 16 sub-arrays of 16 entries each, then 16 matching 
processes, each involving 16 entries, may be carried 
out simultaneously. Each of the matching processes 
yields 0 to 16 matching slots, and a round-robin selec- 
tion process requiring up to 16 additions determ.nesthe 
number of parcels to be assigned. The time requ.red is 
then determined by the time taken to execute 16 com- 
parisons and up to 16 additions. 
[0092] The matching time may be further reduced 
by using more stages. If n is the number of entries in the 
sub-array (16 in the above example)^ and g is the 
number of stages, such that " 9 = * . then the" number 
o^risons is n * g rather than n*. FIG. 18 tabulates 
the number of operations for g = 1. 2. 4. and 8. It .s seen 
that the duration of the matching process, expressed as 
multiples of the duration of a single operation, 
decreases from 256. when g = 1 , to 1 6 when g = 4. 



[0093] FIG. 19 «s a schematic of a device 280 for 
partitioning the diagonal-set matching process. In FIG. 
19 circuit 218 includes a selector and an adder. Also, 
circuit 224 includes a selector and an adder. In this 
5 example, the value of N is 128. Each of the matchmg 
arrays, two in this example, is divided into 16 sub-arrays 
212 and 214 each sub-array has 8 entries and each 
matching circuit 216 compares the respective entries of 
each pair of sub-arrays 212, 214. Each of the circuits 
, 0 218 adds up the results of four matching circuits, and 
enters the lesser of the sum and the predetermined limit 
into a register 222. A third-stage circuit 224 adds up the 
results in registers 222 and enters the lesser of the sum 
and the predetermined limit in register 226. Each 
matching circuit 216 stores the matching result in a 
memory with up to m entries, and the addresses of the 
selected parcels in the data-memory storage are deter- 
mined in a manner well known in the art. 
[0094] Partitioning and parallel processing, as 
20 depicted in FIG. 19. may be used with both the distrib- 
uted and centralized assignment methods. 



Assignment Example 

25 [0095] FIG. 1 7 illustrates the matching process for a 
space-core or rotator-based switch. The figure relates 
to a switch with N = 12 and m = 8. (Each array 184 
stores the vacancy of the N egress modules during an 
access interval j in the succeeding scheduling cycle.) 
30 Each array 182 stores the ^ nsfer 4 a,,ocat ° n ^f ai .; 
respective ingress module. The transfer allocations are 
expressed as a number of parcels that may be trans- 
ferred to each egress module during the entire schedul- 
ing cycle. The transfer allocations are reduced after 
as each assignment. In an ideal assignment process, the 
transfer allocations for each ingress module should 
reduce to zero at the end of each scheduling cycle. 
However, the assignment process is likely to leave a 
remainder of unassigned parcels. An internal expansion 
40 of the switch, i.e.. a ratio of the inner transfer speed to 
the outer transfer speed greater than unity, can be used 
to reduce the remainder of unassigned parcels to insig- 
nificant levels. Unassigned parcels remain in the 
ingress queues to be served in subsequent scheduling 
« cycles. Arrays 182 and 184 are stored in separate mem- 
ories in order to f acilrtate parallel matching as descnbed 

ro0961 The matching process is implemented in N 
steps, where each step corresponds to an access inter- 
so va. and comprises matching of a diagonal set. Adiago- 
nal-set comprises N ordered pairs of arrays 182 and 
184, e.g.. pairs (0. j) to (N-1. {N-1 +j} modulo N). 0 £ j < 

[00971 During access interval j. 0 £ j < N. the trans- 
55 ler allocation array 182 for ingress module 0 is paired 
with the egress state array 184 of access interva^ I j the 
transfer allocation array 182 for ingress modulels 
paired with the egress state array 1 84 of access interval 
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i + 1, and so on. The example of FIG. 17 illustrates the 
matching process for ingress modules 0 and 7. Transfer 
allocation array 182 for ingress module 0 is paired with 
the egress state array 184 of access interval 9 and the 
transfer allocation array 182 for ingress module 7 is 5 
paired with the egress state array 184 for access inter- 
val 4 (( {7 + 9} modulo 12). 

[0098] The outcome of the matching process for 
ingress modules 0 and 7 in this example are as shown 
in respective arrays 186. Ingress module 0 has 17 par- w 
eels allocated for transfer to egress module 0, but the 
corresponding egress state array 184 indicates only one 
available parcel slot. Hence, only one parcel is assigned 
during access interval 9. The transfer allocation for 
egress module 2 is 14, but there are only 4 vacant par- is 
eel slots, hence 4 parcels are assigned. Similarly two 
parcels are assigned to egress module 7. The allocation 
for egress module 9 is 7 parcels while the egress state 
of egress module 9 indicates 8 vacant parcel slots. The 
7 parcels cannot however, be assigned because the 20 
total number of parcels that can be transferred from an 
ingress module during an access interval is only 8. 
Hence, only one parcel is assigned for transfer to 
egress module 9. The matching process need not follow 
a sequential order as illustrated in this example, where 25 
the selected matching transfers are interleaved by other 
matching opportunities such as the transfer to egress 
module 1. 

[0099] The assignment for ingress module 7 yields 
only 6 parcels, indicated by respective total 188, due to 30 
the mismatch of allocations and vacancies as illustrates 
in FIG. 17. Array 184 indicates vacancies for entries cor- 
responding to egress modules 0, 1 , 4, 9, 10, and 1 1 , but 
the corresponding entries in array 182 indicate that no 
parcels are allocated for these egress modules. Array 35 
182 is stored in memory 166 (FIG. 16) and array 184 is 
stored in memory 1 64 (FIG. 16). The resulting array 1 86 
is stored in memory 1 72 (FIG. 16). 
[0100] The outcomes of the diagonal-set matching 
processes are stored in N arrays 186, each of which 40 
shows the number of assigned parcels 188 followed by 
the egress-identity of each parcel. Arrays 186 are used 
to decrease the respective entries in arrays 182 and 
184. Arrays 186 are transferred to the respective 
ingress modules for use in the succeeding scheduling 45 
cycle. 

[0101] FIG. 16 shows a two-way rotator 162 which 
includes a forward rotator and a backward rotator. The 
forward rotator transfers the contents of memories 164 
to the matching circuit 1 68. The backward rotator trans- so 
fers a matching result back to the memories 164 to 
update the egress-state. 

[0102] The assignment • process described above 
applies to both the space-core and rotator-based 
switches. In the rotator-based switch, the egress states ss 
are represented by the occupancy of the core memories 
66. Consequently, corresponding entries in the egress- 
state arrays 164 used to create a diagonal set used in 



the assignment process must be restored to full 
vacancy (a vacancy of 8 parcels each in the example of 
FIG. 1 7) to account for the transfer of parcels from the 
core memories 66 to the egress modules 36. If writing 
precedes reading in a core memory, then at the end of 
the matching process for diagonal-set j, entry {X + j} 
modulo N in the egress-state array 1 64 paired with each 
ingress module X, 0 <: X < N and 0 £ j < N, is reset to 8. 
If reading precedes writing, the entries are reset at the 
start of the diagonal-set matching. 
[01 03] The interpretation of the egress-state arrays 
184 for space-core switches differs from that of the rota- 
tor-based switch, though this differentiation has no 
bearing on the procedures described above. In the 
space-core switch, an array 184 contains the states of 
the N egress modules during an access interval. In the 
rotator-based switch, an array 184 contains the states of 
the N sections 68 of a core memory 66. The sections 68 
of a core memory 66 have one-to-one correspondence 
with the N egress modules 36. Because a core memory 
66 stores a parcel for a deterministic number of access 
intervals before transferring it to the respective egress 
module, it is necessary to restore the vacancy of one 
entry in each array 184 with each diagonal-set matching 
process as described above. 

Transfer Allocation/Scheduling Period 

[01 04] In the space-core based switch architecture, 
the scheduling period is arbitrary and is selected to be 
long enough to enable the scheduler to complete the 
necessary computations. 

[0105] In the rotator-based switch architecture, 
sending traffic data every J > 1 rotator cycles increases 
the duration of the scheduling period. With J = 4, for 
example, and a rotator cycle of 64 microseconds, the 
ingress/egress transfer allocations are modified every 
256 microseconds. The traffic data sent is treated as 
though the traffic load were static for J consecutive 
cycles. With a relatively small J, of 2 to 4 for example, 
the effect on throughput is generally insignificant. 

Segmentation Efficiency 

[0106] The segmentation efficiency is the ratio of 
the total size, determined during an observation inter- 
val, of incoming packets to the total size of the seg- 
mented packets, determined during the same 
observation interval. The respective sizes are 
expressed in arbitrary units, bytes for example. The seg- 
mentation efficiency increases as the segment size 
decreases. Increasing the segment size increases the 
cveralt rtode capacity but reduces the segmentation eff i^ - 
ciency. Segmentation efficiency is preferably taken into 
account when selecting the segment size. 
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Dynamic Computation of the Segmentation Eff i- 
ciency 

r0107l The admission control mechanism in the 
Zh (not illustrated), whether based on declared .est,- 
™ ed predicted, or adaptive service-rates. prov.des a 
P^rnts^data toad for each ^^Z'l 
rotator cycle. This is based on the a^aff.c oad. m 
bytes and does not take into account the effect of packe 
segmentation and the resulting waste induced by null 
padding of the last segment of most packets. The trans- 
?e7a..o?ation mechanism, however, allocates segment* 
not bytes, and a correction factor is therefore required to 
corrpensate for the segmentation waste. The cordon 
may be based on a g.oba. estimate of the waste. Prefe, 
ablv it is determined dynamically for each 
Sess/egress pair and used to govern the adm.ss.on 
conX^ess. The computation of packet sizes before 
and after segmentation is a straightforward process that 
fs understood by persons skilled in the art of packet 
switching. 

Modified Transfer Allocations 

[0108] The ingress/egress transfer allocation must 
akeinto account the effect of null padding of the last 
seaments of packets to be transferred. A simple 
Smach is to modify the required true transfer Ra- 
tion by a ratio that is slightly greater than 1 + 1/(2P)_ 
whereV is the mean number of^gments 
is generally a non-integer number. The ' 
from the egress module to the egress links (to network 
dSnations) is based on the true located rate for a 
connection, since the null-padding is 
packet is presented in a serial bit stream before em.s 
sion from the switch. 

Aggregation Efficiency 

[0109] The aggregation efficiency n is a ratio of the 
rnlan number of fegments per parcel to .he xapaa^ .n 

segments. o^^^^^Zi, 
ncreases with (1) an increase , 
nreshold T (2) a decrease in the max.mum number of 
seg^nte per parcel; and (3) a decrease in the number 
of Tgress^gress) modules N. The worst-oase aggre- 
gation efficiency is computed us.ng the formula. 

t1=1 .(m- 1 -K- 1 )(N-iyT 
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where: 

N is the number erf ingress modules (or egress * 

^hetrgest number of parcels to be transferred 
during an access interval; 
K is the largest number of segments to be trans 
ferred during an access interval (K = m x q . q 



being the number of segments per parcel); and - t 
The threshold T. expressed as an integer-multiple 
of the access interval, is larger than N. 

s [01101 The worst-case aggregation efficiency thus 
calculated may be used in determining ^ required 
internal expansion of the switch and in mod.fy.ng the 
transfer allocations. . 
rami It will, of course, be appreciated that the 
, 0 above description has been given by way of example 
only and that modifications may be made wthmthe 
Se of the present invention. The operating .methodol- 
ogy could, tor example, be supported in the form of 
Smputer code, such as on a CD-ROM, or could other- 
„ wT e be downloaded to a suitable physical archrtecture. 

Claims 

1 A method of reciprocal traff ic control in a switching 
' nodeforuseinadatar^cketnetwork.theswrtchmg 

node including N ingress modules and M egress 
modules. N and M being integers greater than on* 
and a switching core adapted to permit packets to 
be transferred from any one of the ingress modules 
to any one of the egress modules, wherem the data 
packets are sorted into ingress buffers at tt» 
Egress modules so that the packets are arranged .n 
egress module order. CHARACTERIZED by: 

associating a label with each packet to permit 
an ingress module at which the packet was 
received to be identified prior to sorting in the 
ingress modules; and 

sorting data packets into egress buffers at the 
egress modules using the label to determine a 
sort order of the data packets in the egress 
buffers. 

2. The method as claimed in claim 1 herein the 
packets are segmented into packet segmen s o la 
Redetermined length in the ingress modutos for 
Lsfer across a fabric of the switching node and 
the label is associated with each packet segment. 

The method as claimed in claim 2 wherein the label 
comprises a first field that stores a un.que identifier 
that may be used to determine the sort order, a sec- 
ond field to indicate whether the packet segment « 
a last segment in the data packet, and a th.rd f .eld 
containing the packet data. 

4 The method as daimed in claim 3 w^reint^tWrd 
.. «eWisT^dedv^™ndata^^^^ 

ment is a last segment of a data pack* .. and the last 
55 segment is shorter than the th.rd field length. 

5 The method as claimed in claims 3 or 4 wherein 
when 4 second field indicates that the packet seg- 
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ment is a null-padded last segment in the data 
packet, a last S bits of the third field indicate a 
length of the packet data in the third field, the inte- 
ger S being greater than or equal the base-2 loga- 
rithm of a length in bytes of the third field. 

The method as claimed in any preceding claim 
wherein control at each ingress module effects 
transfer rate regulation across a fabric of the switch- 
ing node to each egress module as a result of traffic 
data sent from the ingress module to an allocations 
mechanism that controls transfer rate regulation 
across the core of the switching node. 

The method as claimed in any preceding claim 
wherein the N buffers at an egress module are 
assigned different egress priorities. 

The method as claimed in any preceding claim 
wherein the packet segments are further sorted 
according to class of service designation at the 
ingress module, the egress module or both the 
ingress and the egress modules. 

A method as claimed in any preceding claim 
wherein prior to sorting the data segments received 
at each egress module according to their labels, the 
following steps are performed: 

dividing each packet at ingress into a number 
of the segments; 

prefixing each segment with a header that 
includes a label that is uniquely associated with 
an ingress module that received the packet; 
identifying a last segment in a packet using a 
last segment indicator in the header; 
padding the last segment with null data if the 
payload data in the last segment is shorter than 
the predetermined length; and 
transferring the segments through the core of 
the switching node to the respective egress 
modules from which they must egress from the 
switching node. 

1 0 The method as claimed in claim 9 wherein the seg- 
ments are transferred from the ingress module to 
the egress module under a transfer rate control. 

11. The method as claimed in claim 10 wherein the 
transfer rate from each ingress module to each 
egress module is explicitly specified by an admis- 
sion-control mechanism. 

12. The method as claimed in claim 10 wherein the 
transfer rate from each ingress module to each 
egress module fe determined adaptively according 
to an occupancy of each of the ingress buffers. 



13. The method as claimed in claim 10 wherein the 
transfer rate is adjusted by a factor determined 
dynamically for each ingress/egress pair according 
to a ratio of an aggregation of received packet size 
to an aggregation of a segmented packet size, the 
ratio being computed at predetermined intervals. 

14. A method as claimed in any preceding claim 
wherein the data packet traffic belongs to multiple 
classes of service, and the method further com- 
prises steps of: 

sorting the data packet traffic at each ingress 
module into traffic streams according to egress 
module from which the respective data packets 
must egress from the switching node, and by 
class of service; 

determining an aggregate committed transfer 
rate for each traffic stream; 
computing an aggregate committed transfer 
rate tor each of the traffic streams destined to a 
same egress module; 

determining traffic loads by counting the 
number of packet segments in each of the traf- 
fic streams; 

communicating the aggregate committed 
transfer rate and the traffic loads to a capacity 
transfer allocation mechanism; and 
distributing at the transfer allocation mecha- 
3 o nism any switching core capacity that exceeds 

the aggregate committed transfer rate among 
ingress/egress pairs based on traffic loads. 

15 The method as claimed in claim 14 wherein the 
number of classes of service may be specified for 
each ingress module. 

16 A method as claimed in any preceding claim 
wherein unused core capacity in the data packet 
switch having N ingress modules and N egress 
modules, is distributed to the ingress modules by: 

storing unused ingress capacity in N elements 
of an array X; 

storing unused egress capacity in N elements 
of an array Y; 

determining a total unused capacity by sum- 
ming the N elements of a one of the array X 
and array Y; and 

computing a transfer allocation of the total 
unused capacity for each ingress/egress pair 
by multiplying the respective unused ingress 
capacities by the respective ufiusetf egress 
capacities to obtain N 2 products, and dividing 
55 each of the N 2 products by the total unused 

capacity. 

17. The method as claimed in claim 16 wherein prior to 
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computing a transfer allocation, either of the multi- 
plicand arrays. X or Y, is modified by dividing each 
of its elements by a total unused capacity and a 
transfer allocation for an ingress/egress pair is com- 
puted by multiplying the respective elements in 5 
array X by the elements in array Y. 

18 A method as claimed in claim 17 wherein each ele- 
ment in the modified multiplicand array is left- 
shifted by an integer B bits before the division and 
the result of pair-wise multiplication of elements of 
the modified array and the other array is right- 
shifted B bits, the integer B being an integer greater 
than 8. 

1 9. The method as claimed in claim 1 8 wherein multipli- 
cation processes are carried out in parallel. 

20 A method as claimed in any preceding claim 
wherein capacity in the data packet switch is shared 20 
by: 

(a) sending from each ingress module to a 
transfer allocation mechanism, a committed- 
capacity matrix, each element in the commit- 25 
ted-capacity matrix containing the committed 
capacity of each ingress module with respect to 
each of the egress modules; 

(b) sending from each ingress module to the 
transfer allocation mechanism a matrix storing so 
a number of traffic units waiting to be trans- 
ferred from the ingress module to each of the 
egress modules; 

(c) creating at the transfer allocation mecha- 
nism a base matrix, each entry in the base ss 
matrix being a lesser of corresponding entries 

in the matrix containing the committed capacity 
and the matrix containing the traffic units wait- 
ing to be transferred; 

(d) subtracting entries in the base matrix from 40 
corresponding entries in the matrix containing 
the traffic units to create an unassigned traffic 
matrix; 

(e) computing an unused capacity for each 
ingress module and each egress module; 45 

(f) simultaneously processing the N entries in a 
diagonal set of the unassigned traffic matrix; 

(g) for each ingress/egress pair belonging to a 
diagonal set, determining an additional 
ingress/egress transfer allocation on a basis of so 
the least one of an unused capacity of an 
ingress module of the ingress/egress pair, an 
unused capacity of an egress module of the* 
ingress/egress pair, and a corresponding 
ingress/egress entry in the unassigned traffic 55 
matrix; 

(h) if the additional ingress/egress transfer allo- 
cation is greater than zero, subtracting its value 
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from the unused capacity entry at ingress, fee 
unused capacity entry at egress, and the 
ingress/egress entry in the waiting traffic 
matrix; 

(i) repeating steps (f) to (h) until all diagonals 
are processed; 

(j) repeating steps (a) to (i) each transfer allo- 
cation period; and 

(k) selecting a different order of diagonal 
processing each transfer allocation period. 

A method as claimed in any preceding claim 
wherein the segments in an ingress module are 
aggregated into parcels, each parcel containing at 
least one and at most a predetermined integer 
number of the segments, each segment including a 
unique identifier associated with the ingress mod- 
ule, the segments being aggregated by: 

sorting the segments into logical buffers in a 
payload memory of the ingress module using 
an identifier associated with an egress module 
from which the respective segments must 
egress from the switch, and a delay tolerance 
identifier to determine a sort order for the seg- 
ments; 

increasing by 1 a count of a number of waiting 
segments in each respective buffer each time a 
new segment is added to a one of the buffers, 
the count being maintained in a waiting seg- 
ment count array; 

declaring the new segment a critical segment if 
a respective count of the number of segments 
in a logical buffer is equal to the predetermined 
maximum number of segments in a parcel, 
initializing a respective entry in a timer array if 
the new segment is a critical segment; 
logically transferring segments from the pay- 
load memory to a ready queue associated with 
each egress module when a waiting segment 
count is greater than the predetermined 
number, or a respective entry in the waiting 
segment count array is greater than zero and a 
corresponding entry in the timer array exceeds 
a predetermined time limit associated with the 
logical buffer and increasing a ready-queue 
counter and decreasing a respective entry in 
the waiting segment count array accordingly; 
transferring segments from the ready queue in 
parcels to a next stage of the data packet 
switch, each parcel being padded with null data 
if the number of waiting segments in the ready- 
queue is greater than zero but less than the 
predetermined number and decreasing the 
count array by the number of segments m the 
parcel after each parcel is transferred out of the 
ready queue. 
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22. The method as claimed in claim 21 wherein every 
entry m the waiting segment count array and every 
entry ,n the timer array is examined within a time 
interval sufficient to transfer a parcel from the 
ready-queue. 

23. The method as claimed in claim 22 wherein the 
number of delay tolerance identifiers may vary from 
one ingress module to another ingress module 
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24. A switching node for switching data packets having 
a plurality of ingress modules each including a seg- 
mentation mechanism for deconstructing the pack- 
ets into segments of a predetermined length at 
ingress, storing the segments in buffers and sorting > 5 
the segments, and a plurality of egress modules 
and a switch core interconnecting the ingress mod- 
ules and the egress modules. CHARACTERIZED 
by: 



20 

a selector for selecting which of the buffered 
segments stored in a given one of the ingress 
modules to transfer to the switch core accord- 
ing to the traffic class of service property and 
a packet assembly mechanism for reconstruct- ss 
•ng each packet at egress so that each packet 
is transferred from the switching node in a for- 
mat in which it was received at the ingress 
module. 

25. The switching node as claimed in claim 24 wherein 
the switch core comprises a space-switched core. 

26. The switching node as claimed in claim 24 wherein 
the switch core comprises a bank of memories as 
interposed between two rotators. 

27. A switching node as claimed in any one of claims 
24-26. further comprising: 

m • 40 

,n 9 ress modules and M egress modules. N 
and M being integers greater than one; 
an ingress/egress transfer allocation mecha- 
nism which periodically receives data related to 
data packet traffic to be transferred from each 45 
of the respective N ingress modules; and 
an ingress/egress scheduling mechanism 
which uses data generated by the 
ingress/egress transfer allocation mechanism 
to generate a transfer schedule for each of the so 
respective N ingress modules. 

23;- A switching node as claimed in daim 27 whereftf ■ 
the transfer schedules are periodically communi- 
cated to the respective ingress modules. S5 

29. A switching node as claimed in claim 27 wherein 
the transfer schedules are generated from an 



ingress/egress transfer allocation matrix and a 
transfer schedule time frame used for oen J«L 
the transfer schedule, which is of 9 

30. A pitching node as claimed in any one of claims 
27-29 wherein the transfer allocation mechanist 
grates to allocate a capacity of the switching cJe 
so that the .ngress and egress modules dyW 
cally share an available capacity of the swEg 

31. A switching node as claimed in claim 24 wherein 
the switching node has a rotator-based switch 
a^itecture the transfer allocation mtchSS 
sends transfer rate allocations directly to each of 
W^eN ingress modules, and the respective ingress 
modules perform distributed spatial matching to 
achieve local scheduling for the transfer of packet 
segments to the switching core. 

32. A switching node as claimed in any one of claims 
24-31 wherein a parcel scheduler uses a core 
memory logically partitioned into N sections, each 
section corresponding to a one of the egress mod- 
ules, and each section is capable of storing m data 
parcels, a transfer allocation mechanism determin- 
ing a number of parcels eligible for transfer from 
each ingress module to each egress module during 
a scheduling cycle, the parcel scheduler compris 

a bank of N egress-state memories each 
egress-state memory being logically divided 
into N sections, each section being adapted to 
store a number representative of a number of 
parcels that can be accommodated-in a corre- 
sponding section in the core memory 
a bank of N transfer allocation memories, each 
transfer allocation memory having N entries 
each entry corresponding to an ingress/egress 
module pair and adapted to store any number 
representative of an eligible parcel transfer 
allocation for the ingress/egress pair; 
an NxN rotator to cyclically pair the transfer 
allocation memories and the egress-state 
memories; 

a bank of N matching circuits, each circuit 
determining a lesser of a value stored in a 
transfer allocation memory and a value stored 
in a corresponding entry in the egress-state 
memory; 

a bank of N result memories, each result mem- 
ory corresponding to an ingress' module and- 
adapted to store a sequence of egress module 
numbers, each egress module number repre- 
senting a parcel eligible for transfer to the 
egress module, 

means for subtracting one from the entries of 
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allocation memories for each egress, 
module. 

formed by- 



egress module; * 
means for subtracting one ™ •-"Jj* 
the egress-state memor.es and the transfer 
allocSion memories for each egress module 
number in the result memories; and 
Sstor transferring the contents of each 
Tesul memory to its corresponding ingress 
module. 

. . ar nrnaram element comprising computer 

36 ' A0 °S I i maKe a controller imple- 
rnTp^eTperformthe method steps of any 

of claims 1 to 23. 

37 The computer program element of claim 36. 
embodied* a computer readable med,um. 
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matching simultaneously corresponding ele_ 
222? the transfer allocation memones and 
^ess-state memories to select up to m 
nacket parcels for transfer; and 

ing an egress interval. 

parcel scheduler comprising: 

. « Ki Mre ss-state memories, each 

allocation memories and the egress sn 
memories; aar h circuit 

^^mesunn^rias.eaoh^l.me- 
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